BTS 510 Lab 6

set.seed(12345)
library(tidyverse)

Warning: package 'purrr' was built under R version 4.5.1

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(Stat2Data)
theme_set(theme_classic(base_size = 16))

0.1 Learning objectives

Describe the logic of linear regression
Describe the assumptions of linear regression
Briefly present results of linear regression
Use a regression equation to summarize a model
Compare and contrast observed, predicted, and residual values

0.2 Data

ICU data from the Stat2Data package
- ID: Patient ID code
- Survive: 1 = patient survived to discharge or 0 = patient died
- Age: Age (in years)
- AgeGroup: 1 = young (under 50), 2 = middle (50-69), 3 = old (70+)
- Sex: 1 = female or 0 = male
- Infection: 1 = infection suspected or 0 = no infection
- SysBP: Systolic blood pressure (in mm of Hg)
- Pulse: Heart rate (beats per minute)
- Emergency: 1 = emergency admission or 0 = elective admission

0.3 Tasks

Does blood pressure differ between those with a suspected infection and those without? Fit a linear regression to answer this question.

library(Stat2Data)
data(ICU)
m1 <- lm(data = ICU, SysBP ~ Infection)
summary(m1)


Call:
lm(formula = SysBP ~ Infection, data = ICU)

Residuals:
    Min      1Q  Median      3Q     Max 
-87.452 -18.672  -2.062  18.548 117.328 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  138.672      2.986  46.440  < 2e-16 ***
Infection    -15.220      4.608  -3.303  0.00113 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 32.16 on 198 degrees of freedom
Multiple R-squared:  0.05223,   Adjusted R-squared:  0.04744 
F-statistic: 10.91 on 1 and 198 DF,  p-value: 0.001134

Write up the interpretation of the model, including:

Intercept value, test statistic, p-value
Intercept interpretation (in words)
Slope value, test statistic, p-value
Slope interpretation (in words)
R^2 value, test statistic, p-value
R^2 interpretation (in words)

What is the predicted blood pressure for someone with a suspected infection? What is the predicted blood pressure for someone without a suspected infection?

Predict function

newdat <- data.frame(Infection = c(0,1))
predict(m1, newdat)

       1        2 
138.6724 123.4524

Use R as a smart calculator

bp_0 <- summary(m1)$coefficients[1] + summary(m1)$coefficients[2] * 0
bp_0

[1] 138.6724

bp_1 <- summary(m1)$coefficients[1] + summary(m1)$coefficients[2] * 1
bp_1

[1] 123.4524

Use R as a dumb calculator

bp_0d <- 138.672 - 15.220 * 0
bp_0d

[1] 138.672

bp_1d <- 138.672 - 15.220 * 1
bp_1d

[1] 123.452

Calculate the predicted and residual values for each person and add them to the original dataset. Does it looks like the assumptions are satisfied? Make plots to help you decide:

Residual vs predictor
Q-Q plot of residuals
Histogram of residuals (all together and separately for the two groups)

ICU$pred1 <- fitted(m1)
ICU$resi1 <- residuals(m1)

ggplot(data = ICU, 
       aes(x = Infection, y = resi1)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

ggplot(data = ICU,
       aes(sample = resi1)) + 
       stat_qq() +
       stat_qq_line()

ggplot(data = ICU,
       aes(x = resi1)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = ICU,
       aes(x = resi1)) +
  geom_histogram() +
  facet_grid(cols = vars(Infection))

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Does blood pressure differ between those with a suspected infection and those without? How? (Brief, simple words, no jargon, no statistics.)

1 Plot.

int <- summary(m1)$coefficients[1]
slo <- summary(m1)$coefficients[2]
ggplot(data = ICU,
       aes(x = Infection, y = SysBP)) +
  geom_point(alpha = 0.3) +
  annotate("text", 
    x = 0.5, 
    y = 250, 
    label = bquote(~ hat(Y) == .(round(int, 3)) + .(round(slo, 3))~"* Infection"),
    size = 6)

Warning in is.na(x): is.na() applied to non-(list or vector) of type 'language'